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Abstract 

We  extend  existing  methods  for  automatic  sen¬ 
tence  boundary  detection  by  leveraging  multi¬ 
ple  recognizer  hypotheses  in  order  to  provide 
robustness  to  speech  recognition  errors.  For 
each  hypothesized  word  sequence,  an  HMM 
is  used  to  estimate  the  posterior  probability  of 
a  sentence  boundary  at  each  word  boundary. 

The  hypotheses  are  combined  using  confusion 
networks  to  determine  the  overall  most  likely 
events.  Experiments  show  improved  detec¬ 
tion  of  sentences  for  conversational  telephone 
speech,  though  results  are  mixed  for  broadcast 
news. 

1  Introduction 

The  output  of  most  current  automatic  speech  recognition 
systems  is  an  unstructured  sequence  of  words.  Additional 
information  such  as  sentence  boundaries  and  speaker  la¬ 
bels  are  useful  to  improve  readability  and  can  provide 
structure  relevant  to  subsequent  language  processing,  in¬ 
cluding  parsing,  topic  segmentation  and  summarization. 
In  this  study,  we  focus  on  identifying  sentence  boundaries 
using  word-based  and  prosodic  cues,  and  in  particular  we 
develop  a  method  that  leverages  additional  information 
available  from  multiple  recognizer  hypotheses. 

Multiple  hypotheses  are  helpful  because  the  single  best 
recognizer  output  still  has  many  errors  even  for  state- 
of-the-art  systems.  For  conversational  telephone  speech 
(CTS)  word  error  rates  can  be  from  20-30%,  and  for 
broadcast  news  (BN)  word  error  rates  are  10-15%.  These 
errors  limit  the  effectiveness  of  sentence  boundary  pre¬ 
diction,  because  they  introduce  incorrect  words  to  the 
word  stream.  Sentence  boundary  detection  error  rates 
on  a  baseline  system  increased  by  50%  relative  for  CTS 
when  moving  from  the  reference  to  the  automatic  speech 
condition,  while  for  BN  error  rates  increased  by  about 
20%  relative  (Liu  et  ak,  2003).  Including  additional  rec¬ 
ognizer  hypotheses  allows  for  alternative  word  choices  to 
inform  sentence  boundary  prediction. 

To  integrate  the  information  from  different  alterna¬ 
tives,  we  first  predict  sentence  boundaries  in  each  hypoth¬ 


esized  word  sequence,  using  an  HMM  structure  that  in¬ 
tegrates  prosodic  features  in  a  decision  tree  with  hidden 
event  language  modeling.  To  facilitate  merging  predic¬ 
tions  from  multiple  hypotheses,  we  represent  each  hy¬ 
pothesis  as  a  confusion  network,  with  confidences  for 
sentence  predictions  from  a  baseline  system.  The  final 
prediction  is  based  on  a  combination  of  predictions  from 
individual  hypotheses,  each  weighted  by  the  recognizer 
posterior  for  that  hypothesis. 

Our  methods  build  on  related  work  in  sentence  bound¬ 
ary  detection  and  confusion  networks,  as  described  in 
Section  2,  and  a  baseline  system  and  task  domain  re¬ 
viewed  in  Section  3.  Our  approach  integrates  prediction 
on  multiple  recognizer  hypotheses  using  confusion  net¬ 
works,  as  outlined  in  Section  4.  Experimental  results  are 
detailed  in  Section  5,  and  the  main  conclusions  of  this 
work  are  summarized  in  Section  6. 

2  Related  Work 

2.1  Sentence  Boundary  Detection 

Previous  work  on  sentence  boundary  detection  for  auto¬ 
matically  recognized  words  has  focused  on  the  prosodic 
features  and  words  of  the  single  best  recognizer  output 
(Shriberg  et  ak,  2000).  That  system  had  an  HMM  struc¬ 
ture  that  integrates  hidden  event  language  modeling  with 
prosodic  decision  tree  outputs  (Breiman  et  ak,  1984).  The 
HMM  states  predicted  at  each  word  boundary  consisted 
of  either  a  sentence  or  non-sentence  boundary  classifica¬ 
tion,  each  of  which  received  a  confidence  score.  Improve¬ 
ments  to  the  hidden  event  framework  have  included  inter¬ 
polation  of  multiple  language  models  (Liu  et  ak,  2003). 

A  related  model  has  been  used  to  investigate  punc¬ 
tuation  prediction  for  multiple  hypotheses  in  a  speech 
recognition  system  (Kim  and  Woodland,  2001).  That  sys¬ 
tem  found  improvement  in  punctuation  prediction  when 
rescoring  using  the  classification  tree  prosodic  feature 
model,  but  it  also  introduced  a  small  increase  in  word 
error  rate.  More  recent  work  has  also  implemented  a  sim¬ 
ilar  model,  but  used  prosodic  features  in  a  neural  net  in¬ 
stead  of  a  decision  tree  (Srivastava  and  Kubala,  2003). 
A  maximum  entropy  model  that  included  pause  informa¬ 
tion  was  used  in  (Huang  and  Zweig,  2002).  Both  finite- 
state  models  and  neural  nets  have  been  investigated  for 
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prosodic  and  lexical  feature  combination  in  (Christensen 
et  al.,  2001). 

2.2  Confusion  Networks 

Confusion  networks  are  a  compacted  representation  of 
word  lattices  that  have  strictly  ordered  word  hypothesis 
slots  (Mangu  et  al.,  2000).  The  complexity  of  lattice  rep¬ 
resentations  is  reduced  to  a  simpler  form  that  maintains 
all  possible  paths  from  the  lattice  (and  more),  but  trans¬ 
forms  the  space  to  a  series  of  slots  which  each  have  word 
hypotheses  (and  null  arcs)  derived  from  the  lattice  and  as¬ 
sociated  posterior  probabilities.  Confusion  networks  may 
also  be  constructed  from  an  N-best  list,  which  is  the  case 
for  these  experiments.  Confusion  networks  are  used  to 
optimize  word  error  rate  (WER)  by  selecting  the  word 
with  the  highest  probability  in  each  particular  slot. 

3  Tasks  &  Baseline 

This  work  specifically  detects  boundaries  of  sentence¬ 
like  units  called  SUs.  An  SU  roughly  corresponds  to  a 
sentence,  except  that  SUs  are  for  the  most  part  defined  as 
units  that  include  only  one  independent  main  clause,  and 
they  may  sometimes  be  incomplete  as  when  a  speaker 
is  interrupted  and  does  not  complete  their  sentence.  A 
more  specific  annotation  guideline  for  SUs  is  available 
(Strassel,  2003),  which  we  refer  to  as  the  “V5”  standard. 
In  this  work,  we  focus  only  on  detecting  SUs  and  do  not 
differentiate  among  the  different  types  (e.g.  statement, 
question,  etc.)  that  were  used  for  annotation.  We  work 
with  a  relatively  new  corpus  and  set  of  evaluation  tools, 
which  are  described  below. 

3.1  Corpora 

The  system  is  evaluated  for  both  conversational  telephone 
speech  (CTS)  and  broadcast  news  (BN),  in  both  cases  us¬ 
ing  training,  development  and  test  data  annotated  accord¬ 
ing  to  the  V5  standard.  The  test  data  is  that  used  in  the 
DARPA  Rich  Transcription  (RT)  Fall  2003  evaluations; 
the  development  and  evaluation  test  sets  together  com¬ 
prise  the  Spring  2003  RT  evaluation  test  sets. 

For  CTS,  there  are  40  hours  of  conversations  available 
for  training  from  the  Switchboard  corpus,  and  3  hours 
(72  conversation  sides)  each  of  development  and  evalua¬ 
tion  test  data  drawn  from  both  the  Switchboard  and  Fisher 
corpora.  The  development  and  evaluation  set  each  have 
roughly  6000  SUs. 

The  BN  data  consists  of  a  set  of  20  hours  of  news 
shows  for  training,  and  3  hours  (6  shows)  for  testing.  The 
development  and  evaluation  test  data  contains  1 .5  hours 
(3  shows)  each  for  development  and  evaluation,  each  with 
roughly  1000  SUs.  Test  data  comes  from  the  month  of 
February  in  2001;  training  data  is  taken  from  a  previous 
time  period. 


3.2  Baseline  System 

The  automatic  speech  recognition  systems  used  were  up¬ 
dated  versions  of  those  used  by  SRI  in  the  Spring  2003 
RT  evaluations  (NIST,  2003),  with  a  WER  of  12.1% 
on  BN  data  and  22.9%  on  CTS  data.  Both  systems 
perform  multiple  recognition  and  adaptation  passes,  and 
eventually  produce  up  to  2000-best  hypotheses  per  wave¬ 
form  segment,  which  are  then  rescored  with  a  number  of 
knowledge  sources,  such  as  higher-order  language  mod¬ 
els,  pronunciation  scores,  and  duration  models  (for  CTS). 
For  best  results,  the  systems  combine  decoding  output 
from  multiple  front  ends,  each  producing  a  separate  N- 
best  list.  All  N-best  lists  for  the  same  waveform  segment 
are  then  combined  into  a  single  word  confusion  network 
(Mangu  et  al.,  2000)  from  which  the  hypothesis  with  low¬ 
est  expected  word  error  is  extracted.  In  our  baseline  SU 
system,  the  single  best  word  stream  thus  obtained  is  then 
used  as  the  basis  for  SU  recognition. 

Our  baseline  SU  system  builds  on  previous  work  on 
sentence  boundary  detection  using  lexical  and  prosodic 
features  (Shriberg  et  al.,  2000).  The  system  takes  as  in¬ 
put  alignments  from  either  reference  or  recognized  (1- 
best)  words,  and  combines  lexical  and  prosodic  infor¬ 
mation  using  an  HMM.  Prosodic  features  include  about 
100  features  reflecting  pause,  duration,  FO,  energy,  and 
speaker  change  information.  The  prosody  model  is  a  de¬ 
cision  tree  classifier  that  generates  the  posterior  probabil¬ 
ity  of  an  SU  boundary  at  each  interword  boundary  given 
the  prosodic  features.  Trees  are  trained  from  sampled 
training  data  in  order  to  make  the  model  sensitive  to  fea¬ 
tures  of  the  minority  SU  class.  Recent  prosody  model  im¬ 
provements  include  the  use  of  bagging  techniques  in  deci¬ 
sion  tree  training  to  reduce  the  variability  due  to  a  single 
tree  (Fiu  et  al.,  2003).  Language  model  improvements 
include  adding  information  from  a  POS-based  model,  a 
model  using  automatically-induced  word  classes,  and  a 
model  trained  on  separate  data. 

3.3  Evaluation 

Errors  are  measured  by  a  slot  error  rate  similar  to  the 
WER  metric  utilized  by  the  speech  recognition  commu¬ 
nity,  i.e.  dividing  the  total  number  of  inserted  and  deleted 
SUs  by  the  total  number  of  reference  SUs.  (There  are 
no  substitution  errors  because  there  is  only  one  sentence 
class.)  When  recognition  output  is  used,  the  words  will 
generally  not  align  perfectly  with  the  reference  transcrip¬ 
tion  and  hence  the  SU  boundary  predictions  will  require 
some  alignment  procedure  to  match  to  the  reference  lo¬ 
cation.  Here,  the  alignment  is  based  on  the  minimum 
word  error  alignment  of  the  reference  and  hypothesized 
word  strings,  and  the  minimum  SU  error  alignment  if  the 
WER  is  equal  for  multiple  alignments.  We  report  num¬ 
bers  computed  with  the  su-eval  scoring  tool  from  NIST. 
SU  error  rates  for  the  reference  words  condition  of  our 


baseline  system  are  49.04%  for  BN,  and  30.13%  for  CTS, 
as  reported  at  the  NIST  RT03F  evaluation  (Liu  et  al., 
2003).  Results  for  the  automatic  speech  recognition  con¬ 
dition  are  described  in  Section  5. 

4  Using  N-Best  Sentence  Hypotheses 

The  large  increase  in  SU  detection  error  rate  in  mov¬ 
ing  from  reference  to  recognizer  transcripts  motivates  an 
approach  that  reduces  the  mistakes  introduced  by  word 
recognition  errors.  Although  the  best  recognizer  output  is 
optimized  to  reduce  word  error  rate,  alternative  hypothe¬ 
ses  may  together  reinforce  alternative  (more  accurate)  SU 
predictions.  The  oracle  WER  for  the  confusion  networks 
is  much  lower  than  for  the  single  best  hypothesis,  in  the 
range  of  13-16%  WER  for  the  CTS  test  sets. 

4.1  Feature  Extraction  and  SU  Detection 

Prediction  of  SUs  using  multiple  hypotheses  requires 
prosodic  feature  extraction  for  each  hypothesis,  which 
in  turn  requires  a  forced  alignment  of  each  hypothesis. 
Thousands  of  hypotheses  are  output  by  the  recognizer, 
but  we  prune  to  a  smaller  set  to  reduce  the  cost  of  run¬ 
ning  forced  alignments  and  prosodic  feature  extraction. 
The  recognizer  outputs  an  N-best  list  of  hypotheses  and 
assigns  a  posterior  probability  to  each  hypothesis,  which 
is  normalized  to  sum  to  1  over  all  hypotheses.  We  collect 
hypotheses  from  the  N-best  list  for  each  acoustic  segment 
up  to  90%  of  the  posterior  mass  (or  to  a  maximum  count 
of  1000). 

Next,  forced  alignment  and  prosodic  feature  extraction 
are  run  for  all  segments  in  this  pruned  set  of  hypothe¬ 
ses.  Statistics  for  prosodic  feature  normalization  (such  as 
speaker  and  turn  EO  mean)  are  collected  from  the  single 
best  hypothesis.  After  obtaining  the  prosodic  features, 
the  HMM  predicts  sentence  boundaries  for  each  word  se¬ 
quence  hypothesis  independently.  Eor  each  hypothesis, 
an  SU  prediction  is  made  at  all  word  boundaries,  result¬ 
ing  in  a  posterior  probability  for  SU  and  no_SU  at  each 
boundary.  The  same  models  are  used  as  in  the  1-best  pre¬ 
dictions  -  no  parameters  were  re-optimized  for  the  N-best 
framework.  Given  independent  predictions  for  the  indi¬ 
vidual  hypotheses,  we  then  build  a  system  to  incorporate 
the  multiple  predictions  into  a  single  hypothesis,  as  de¬ 
scribed  next. 

4.2  Combining  Hypotheses 

The  prediction  results  for  an  individual  hypothesis  are 
represented  in  a  confusion  network  that  consists  of  a 
series  of  word  slots,  each  followed  by  a  slot  with  SU  and 
no_SU,  as  shown  in  Eigure  1  with  hypothetical  confi¬ 
dences  for  the  between-word  events.  (This  representation 
is  a  somewhat  unusual  form  because  the  word  slots  have 
only  a  single  hypothesis.)  The  words  in  the  individual 
hypotheses  have  probability  one,  and  each  arc  with  an 


SU  or  no_SU  token  has  a  confidence  (posterior  prob¬ 
ability)  assigned  from  the  HMM.  The  overall  network 
has  a  score  associated  with  its  N-best  hypothesis-level 
posterior  probability,  scaled  by  a  weight  corresponding  to 
the  goodness  of  the  system  that  generated  that  hypothesis. 

su  '  SU  \  su  \ 

president  /no_SU  \  at  /no_SUN  war  /no_SU  \ 

Eigure  1:  Confusion  network  for  a  single  hypothesis. 

The  confusion  networks  for  each  hypothesis  are 
then  merged  with  the  SRI  Language  Modeling  Toolkit 
(Stolcke,  2002)  to  create  a  single  confusion  network 
for  an  overall  hypothesis.  This  confusion  network  is 
derived  from  an  alignment  of  the  confusion  networks 
of  each  individual  hypothesis.  The  resulting  network 
contains  slots  with  the  word  hypotheses  from  the  N-best 
list  and  slots  with  the  combined  SU/no_SU  probability, 
as  shown  in  Eigure  2.  The  confidences  assigned  to  each 
token  in  the  new  confusion  network  are  a  weighted 
linear  combination  of  the  probabilities  from  individual 
hypotheses  that  align  to  each  other,  compiled  from 
the  entire  hypothesis  list,  where  the  weights  are  the 
hypothesis-level  scores  from  the  recognizer. 

president  fnQ_S\}\^  at'^^^,/no_SLr\L  war'*^,,/" 

Eigure  2:  Confusion  network  for  a  merged  hypothesis. 

Einally,  the  best  decision  at  each  point  is  selected  by 
choosing  the  words  and  boundaries  with  the  highest  prob¬ 
ability.  Here,  the  words  and  SUs  are  selected  indepen¬ 
dently,  so  that  we  obtain  the  same  words  as  would  be 
selected  without  inserting  the  SU  tokens  and  guarantee 
no  degradation  in  WER.  The  key  improvement  is  that  the 
SU  detection  is  now  a  result  of  detection  across  all  recog¬ 
nizer  hypotheses,  which  reduces  the  effect  of  word  errors 
in  the  top  hypothesis. 

5  Experiments 

Table  1  shows  the  results  in  terms  of  slot  error  rate  on 
the  four  test  sets.  The  middle  column  indicates  the  per¬ 
formance  on  a  single  hypothesis,  with  the  words  derived 
from  the  pruned  set  of  N-best  hypotheses.  The  right  col¬ 
umn  indicates  the  performance  of  the  system  using  mul¬ 
tiple  hypotheses  merged  with  confusion  networks. 

Multiple  hypotheses  provide  a  reduction  of  error  for 
both  test  sets  of  CTS  (significant  at  p<.02  using  the  Mc- 
Nemar  test),  but  give  insignificant  (and  mixed)  results  for 
BN.  The  small  increase  in  error  for  the  BN  evaluation  set 


WER 

SU  error  rate 

Single  Best 

Confusion  Nets 

BN  Dev 

12.2 

55.79% 

54.45% 

BN  Eval 

12.0 

57.78% 

58.42% 

CTS  Dev 

23.6 

44.14% 

42.72% 

CTS  Eval 

22.2 

44.95% 

44.01% 

Table  1:  Word  and  SU  error  rates  for  single  best  vs.  con¬ 
fusion  nets. 

may  be  due  to  the  fact  that  the  1-best  parameters  were 
tuned  on  different  news  shows  than  were  represented  in 
the  evaluation  data. 

We  expected  a  greater  gain  from  the  use  of  confusion 
networks  in  CTS  than  BN,  given  the  previously  shown 
impact  of  WER  on  1-best  SU  detection.  Additionally, 
incorporating  a  larger  number  of  N-best  hypotheses  has 
improved  results  in  all  experiments  so  far,  so  we  would 
expect  this  trend  to  continue  for  additional  increases,  but 
time  constraints  limited  our  ability  to  run  these  larger  ex¬ 
periments.  One  possible  explanation  for  the  relatively 
small  performance  gains  is  that  we  constrained  the  con¬ 
fusion  network  topology  so  that  there  was  no  change  in 
the  word  recognition  results.  We  imposed  this  constraint 
in  our  initial  investigations  to  allow  us  to  compare  per¬ 
formance  using  the  same  words.  It  it  possible  that  better 
performance  could  be  obtained  by  using  confusion  net¬ 
work  topologies  that  link  words  and  metadata. 

A  more  specific  breakout  of  error  improvement  for  the 
CTS  development  set  is  given  in  Table  2,  showing  that 
both  recall  and  precision  benefit  from  using  the  N-best 
framework.  Including  multiple  hypotheses  reduces  the 
number  of  SU  deletions  (improves  recall),  but  the  pri¬ 
mary  gain  is  in  reducing  insertion  errors  (higher  preci¬ 
sion).  The  same  effect  holds  for  the  CTS  evaluation  set. 


Single  Best 

Confusion  Nets 

Change 

Deletions 

1623 

1597 

-1.6% 

Insertions 

872 

818 

-6.2% 

Total 

2495 

2415 

-3.2% 

Table  2:  Errors  for  CTS  development  set 


6  Conclusion 

Detecting  sentence  structure  in  automatic  speech  recog¬ 
nition  provides  important  information  for  language  pro¬ 
cessing  or  human  understanding.  Incorporating  multiple 
hypotheses  from  word  recognition  output  can  improve 
overall  detection  of  SUs  in  comparison  to  prediction  on  a 
single  hypothesis.  This  is  especially  true  for  CTS,  which 
suffers  more  from  word  errors  and  can  therefore  benefit 


from  considering  alternative  hypotheses. 

Euture  work  will  involve  a  tighter  integration  of  SU  de¬ 
tection  and  word  recognition  by  including  SU  events  di¬ 
rectly  in  the  recognition  lattice.  This  will  provide  oppor¬ 
tunities  to  investigate  the  interaction  of  automatic  word 
recognition  and  structural  metadata,  hopefully  resulting 
in  reduced  WER.  We  also  plan  to  extend  these  methods 
to  additional  tasks  such  as  disfluency  detection. 
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